271. Encode and Decode Strings
Medium

Design an algorithm to encode a list of strings to a string. The encoded string is then sent over the network and is decoded back to the original list of strings.

Machine 1 (sender) has the function:

string encode(vector<string> strs) {
  // ... your code
  return encoded_string;
}
Machine 2 (receiver) has the function:
vector<string> decode(string s) {
  //... your code
  return strs;
}

So Machine 1 does:

string encoded_string = encode(strs);

and Machine 2 does:

vector<string> strs2 = decode(encoded_string);

strs2 in Machine 2 should be the same as strs in Machine 1.

Implement the encode and decode methods.

You are not allowed to solve the problem using any serialize methods (such as eval).

 

Example 1:

Input: dummy_input = ["Hello","World"]
Output: ["Hello","World"]
Explanation:
Machine 1:
Codec encoder = new Codec();
String msg = encoder.encode(strs);
Machine 1 ---msg---> Machine 2

Machine 2:
Codec decoder = new Codec();
String[] strs = decoder.decode(msg);

Example 2:

Input: dummy_input = [""]
Output: [""]

 

Constraints:

  • 1 <= strs.length <= 200
  • 0 <= strs[i].length <= 200
  • strs[i] contains any possible characters out of 256 valid ASCII characters.

 

Follow up: Could you write a generalized algorithm to work on any possible set of characters?

Accepted
71,356
Submissions
211,823
Seen this question in a real interview before?
Companies
Related Topics

Average Rating: 4.39 (23 votes)

Premium

Solution


Approach 1: Non-ASCII Delimiter

Intuition

Naive solution here is to join strings using delimiters.

What to use as a delimiter? Each string may contain any possible characters out of 256 valid ascii characters.

Seems like one has to use non-ASCII unichar character, for example unichr(257) in Python and Character.toString((char)257) in Java (it's character ā).

fig

Here it's convenient to use two different non-ASCII characters, to distinguish between situations of "empty array" and of "array of empty strings".

Implementation

Use split in Java with a second argument -1 to make it work as split in Python.

Complexity Analysis

  • Time complexity : O(N)\mathcal{O}(N) both for encode and decode, where N is a number of strings in the input array.

  • Space complexity : O(1)\mathcal{O}(1) for encode to keep the output, since the output is one string. O(N)\mathcal{O}(N) for decode keep the output, since the output is an array of strings.


Approach 2: Chunked Transfer Encoding

Pay attention to this approach because last year Google likes to ask that sort of low-level optimisation. Serialize and deserialize BST problem is a similar example.

This approach is based on the encoding used in HTTP v1.1. It doesn't depend on the set of input characters, and hence is more versatile and effective than Approach 1.

Data stream is divided into chunks. Each chunk is preceded by its size in bytes.

Encoding Algorithm

fig

  • Iterate over the array of chunks, i.e. strings.

    • For each chunk compute its length, and convert that length into 4-bytes string.

    • Append to encoded string :

      • 4-bytes string with information about chunk size in bytes.

      • Chunk itself.

  • Return encoded string.

Decoding Algorithm

fig

  • Iterate over the encoded string with a pointer i initiated as 0. While i < n:

    • Read 4 bytes s[i: i + 4]. It's chunk size in bytes. Convert this 4-bytes string to integer length.

    • Move the pointer by 4 bytes i += 4.

    • Append to the decoded array string s[i: i + length].

    • Move the pointer by length bytes i += length.

  • Return decoded array of strings.

Implementation

Complexity Analysis

  • Time complexity : O(N)\mathcal{O}(N) both for encode and decode, where N is a number of strings in the input array.

  • Space complexity : O(1)\mathcal{O}(1) for encode to keep the output, since the output is one string. O(N)\mathcal{O}(N) for decode keep the output, since the output is an array of strings.

Report Article Issue

Comments: 27
san_py's avatar
Read More

Understanding bytes = [chr(x >> (i * 8) & 0xff) for i in range(4)]

  1. all you need to do is encode the length of x into 4 bytes ( why 4 bytes - integer size - 4 bytes = [8bits, 8bits, 8bits, 8bits])

  2. ok so how do you get a X(length of str) total size into chunks of 8 bits ?
    2.1 >> is right shift - which means if you have 101111 >> 2 - this right shift moves 101111, 2 times to the right - which
    becomes 1011
    2.2 if you go to python terminal and type 0xff you get 8 1's which represents 11111111 = a BYTE
    2.3 if you AND(&) any number with 0xff = it gives you the right most 8 bits of the number
    example: 1. bin(291) (binary of 291) is '0b100100011'
    2. oxff binary is '0b11111111'
    3. now if you 291 & 0xff = you get last 8 bits of 291 == 00100011
    If you understand this - you understood the solution.

    Now as explained in the 2 point. The python solution line 6 we take the whole length of the string (len(str)) - we have to
    encode that into [8bits, 8 bits, 8bits, 8bits]
    example: 1. lets say our total length is 291 ( in binary its bin(291) = '0b100100011')
    2. what you have to do now ? we grab the right most 8 bits - how do you grab right most 8 bits ?
    2.1 as mentioned above you do 291 & 0xff = you get last 8 bits

  3. Now you already have right most 8 bits - in next iteration you are interested in NEXT 8 bits - how do you get
    that ? we move 291 >> 8 ( which removes the last 8 bits we already computed) - which means if you do
    (291 >> 8 ) & 0xff = it gives you the next right most 8 bits

  4. as already mentioned we need 4 chunks of 8 bits [8bits, 8bits, 8bits, 8bits]
    4.1 [ for i in range(4)] thats why the range is 4 ( 0, 1, 2, 3)

    Once you compute all the 8bits we need to convert to char hence its [chr((x >> (i * 8)) & 0xff) for i in range(4)]

Hope this helps!

81
Show 7 replies
Reply
Share
Report
jrhero's avatar
Read More

what if length of string exceeds the Integer.MAX_VALUE (which means 4 butes cannot hold the length)?

16
Show 9 replies
Reply
Share
Report
xitrium's avatar
Read More

The space analysis is wrong here - just because a solution uses one string doesn't make it O(1) as strings have a variable size. Clearly the output of encode in both of these examples is linearly proportional to the size of the input.

11
Show 1 reply
Reply
Share
Report
willye's avatar
Read More

Chunked encoding is too clever for me... I dont think I can solve this in an interview lol

19
Show 4 replies
Reply
Share
Report
rocky-andre's avatar
Read More

what is "BE CodecDriver error? Python 3.* version do not have the issue atleast I do not see it. Also python 3.0 version uses chr() instead of unichr

5
Show 2 replies
Reply
Share
Report
cipone's avatar
Read More

HI, I am a little confused by this statement. For each chunk compute its length, and convert that length into 4-bytes string. It is actually a 8 bytes string since a char is 2 bytes (16 bits). And then the stored amount is sub-optimal, it uses 64 bits to store a 2**32 number.

6
Show 4 replies
Reply
Share
Report
EzMid's avatar
Read More

Isn't a char 2 bytes?

3
Show 1 reply
Reply
Share
Report
sin1080's avatar
Read More

What about just implement the full HTTP Chunked Transfer Encoding and dump length as readable text representations. If you dump integer as bytes you can easily encounter endian issues between machines and things like strict pointer aliasing violation if you are using C/C++ and are not careful enough about dark corners of the language standards (E.g. if you used an int* to dereference data stored in a char* / string, you entered the land of undefined behaviour).

3
Show 4 replies
Reply
Share
Report
rivergillis's avatar
Read More

One variation on the chunk encoding is to use a "header" to start the packet off with a list of str lengths and end with some sentinel. Because you strip the header off at the start of every decode you put whatever you want in there to help you out with decoding. This is sort of like what IP does. A network packet gets some header for transport that later gets stripped.

packet to encode: [abc, def, ghi] -> "3 3 3Aabcdefghi"
decoder strips off everything up till 'A' and uses that to figure out the string lengths.

2
Reply
Share
Report
lenchen1112's avatar
Read More

Approach 1 by using escape character

class Codec:
    def encode(self, strs: [str]) -> str:
        return str(len(strs)) + '\t' + '\t'.join(strs)

    def decode(self, s: str) -> [str]:
        length, *strs = s.split('\t')
        return [] if length == '0' else strs

Approach 2 by prefix length

class Codec:
    def encode(self, strs: [str]) -> str:
        return ''.join([f'{len(s)}:{s}' for s in strs])

    def decode(self, s: str) -> [str]:
        result = []
        while s:
            length, s = s.split(':', 1)
            result.append(s[:int(length)])
            s = s[int(length):]
        return result

3
Show 2 replies
Reply
Share
Report

You don't have any submissions yet.

/23
Contribute
|||
Saved
Trie
This list is empty.